House Prices in New York

Group CC901E2

The University of Sydney

Introduction

Topics

  • The aim for the model is to predict the House Price in New York

  • The data set was original sourced from the mosaic Data package in R

  • The data set contains information about New York Houses including prices of Houses (in Us Dollars), lot size of the house (acres), number of bedrooms and bathrooms and the type of heating system

Methodology

  • to achieve our aim we perform several models
  • then our models were evaluated using RMSE, MAE and adjusted \(r^2\) to find the most useful and appropriate model to use

Basic Summary of the Data

Data Structure

  • it is a large data set with 1,734 rows and 16 columns
  • 16 variables, 10 numeric variables and 6 factor variables
  • the test variable was removed from the data set because we do not know the meaning of the variable
Rows: 1,734
Columns: 16
$ Price         <int> 132500, 181115, 109000, 155000, 86060, 120000, 153000, 1…
$ Lot.Size      <dbl> 0.09, 0.92, 0.19, 0.41, 0.11, 0.68, 0.40, 1.21, 0.83, 1.…
$ Waterfront    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Age           <int> 42, 0, 133, 13, 0, 31, 33, 23, 36, 4, 123, 1, 13, 153, 9…
$ Land.Value    <int> 50000, 22300, 7300, 18700, 15000, 14000, 23300, 14600, 2…
$ New.Construct <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Central.Air   <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ Fuel.Type     <fct> Electric, Gas, Gas, Gas, Gas, Gas, Oil, Oil, Electric, G…
$ Heat.Type     <fct> Electric, Hot Water, Hot Water, Hot Air, Hot Air, Hot Ai…
$ Sewer.Type    <fct> Private, Private, Public, Private, Public, Private, Priv…
$ Living.Area   <int> 906, 1953, 1944, 1944, 840, 1152, 2752, 1662, 1632, 1416…
$ Pct.College   <int> 35, 51, 51, 51, 51, 22, 51, 35, 51, 44, 51, 51, 41, 57, …
$ Bedrooms      <int> 2, 3, 4, 3, 2, 4, 4, 4, 3, 3, 7, 3, 2, 3, 3, 3, 3, 4, 2,…
$ Fireplaces    <int> 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ Bathrooms     <dbl> 1.0, 2.5, 1.0, 1.5, 1.0, 1.0, 1.5, 1.5, 1.5, 1.5, 1.0, 2…
$ Rooms         <int> 5, 6, 8, 5, 3, 8, 8, 9, 8, 6, 12, 6, 4, 5, 8, 4, 7, 12, …

Mean, SD, Max, Min, etc.

This data set does not contain any missing data however, outliers do exist in this data set. For example a 0- acre lot size cannot exist. Also, it is unlikely that $5000 USD would be enough to purchase a House in New York in 2006.

columns Mean SD Max Min
price 211545.05 98553.81 775000.00 5000.00
lot size 0.50 0.70 12.20 0.00
age 28.26 29.86 225.00 0.00
land value 34536.23 34980.94 412600.00 200.00
pct college 55.57 10.32 82.00 20.00
living area 1752.63 620.22 5228.00 616.00
bedrooms 3.15 0.82 7.00 1.00
rooms 7.03 2.32 12.00 2.00

Simple Regression

General view

Variable chosen

Since our dependent variable is ‘Price’, based on the graph, it can easily found that the variable ‘Living.Area’ may be the most affected by ‘Price’ (0.71, with dark pink).

Dependent variable: Price

Independent variable: Living.Area


Call:
lm(formula = Price ~ Living.Area, data = house)

Coefficients:
(Intercept)  Living.Area  
    12844.2        113.4  

Assumption checking

The residuals \(\varepsilon_i\)are iid\({\mathcal N}(0,\sigma^2)\) and there is a linear relationship between y and x.

Assumption checking

  • Linearity: The Auxiliary line is reasonably well plotted like a straight line (with no obvious curve), so there is no obvious pattern in the residual vs fitted values plot.

  • Homoskedasticity: It appears the residuals are getting spread-out and do not appear to be fanning out or changing their variability over the range of the fitted values so the constant error variance assumption is met.

  • Normality: in the QQ plot, the points are reasonably close to the diagonal line. Although there seems to be some outliers exist, the data-set is relatively large(with 1733 observations). Thus, the normality assumption is approximately satisfied.

Simple regression model

Conclusion:

As the assumption all met, it can be concluded that our simple estimated model is \(\widehat{Price} = 12844.179 + 113.373 \times Living.Area\)

Multiple Regression

Backward AIC

x
(Intercept) 7740.537
Lot.Size 7372.038
Waterfront1 120327.799
Age -140.797
Land.Value 0.920
New.Construct1 -44544.524
Central.Air1 9639.198
Heat.TypeHot Air 9998.556
Heat.TypeHot Water -511.478
Heat.TypeNone -32952.329
Living.Area 70.172
Bedrooms -7797.557
Bathrooms 23048.177
Rooms 3045.908

Assumption Checking for Backward AIC (1)

Assumption Checking for Backward AIC (2)

Forward AIC

x
(Intercept) 7740.537
Living.Area 70.172
Land.Value 0.920
Bathrooms 23048.177
Waterfront1 120327.799
New.Construct1 -44544.524
Heat.TypeHot Air 9998.556
Heat.TypeHot Water -511.478
Heat.TypeNone -32952.329
Lot.Size 7372.038
Central.Air1 9639.198
Age -140.797
Rooms 3045.908
Bedrooms -7797.557

Assumption Checking for Forward AIC (1)

Assumption Checking for Forward AIC (2)

Fitted Model

Price ~ Lot.Size + Waterfront1 + Age + Land.Value + New.Construct1 + Central.Air1 + Fuel.TypeGas + Fuel.TypeOil + Heat.TypeHot Water + Heat.TypeNone + Living.Area + Bedrooms + Bathrooms + Rooms

Stable Model

Variable Inclusion Plot

Variable Inclusion Plot: Insight

  • Land.Value, Living.Area, Bathrooms, New.Construct, Waterfront are the five most important variables for predicting Price.
  • The non-monotonic nature of the HeatTypeHot.Air line indicates that a group of variables contains similar information to it.
  • The path of Sewer.TypePublic and all levels of FuelType lie below the path of redundant variable, which means they are included in models by chance (don’t provide any useful information).

Model stability Plot

  • There appears to be dominant models in models of size two, three, and five (including the intercept), as demonstrated by one of the circles being substantially larger than the other circles with models of the same size.
  • The stepwise model (size = 14) is not stable as all circles of that size are small.

Model stability Plot

                                                              name prob
                                                           Price~1 1.00
                                                 Price~Living.Area 1.00
                                      Price~Land.Value+Living.Area 1.00
                            Price~Land.Value+Living.Area+Bathrooms 0.56
                Price~Waterfront1+Land.Value+Living.Area+Bathrooms 0.78
 Price~Waterfront1+Land.Value+New.Construct1+Living.Area+Bathrooms 0.57
 logLikelihood
     -22398.09
     -21781.28
     -21595.32
     -21560.23
     -21529.76
     -21513.72
  • The simple regression model \(Price \sim Living.Area\) model is stable as it is always selected in bootstrap resamples.
  • The \(Price \sim Waterfront1 + Land.Value + Living.Area + Bathrooms\) model is selected in 78% of bootstrap resamples, which strikes the balance between accuracy and stability.

Assumption Checking for the stable model

Discussion: model evaluations

Which model is better?

Our models - simple model

\[\widehat{Price} = 12844.179 + 113.373 \times Living.Area\]

Our models - backward and forward models

  • Backward stepwise regression models:
x
(Intercept) 7740.537
Lot.Size 7372.038
Waterfront1 120327.799
Age -140.797
Land.Value 0.920
New.Construct1 -44544.524
Central.Air1 9639.198
Heat.TypeHot Air 9998.556
Heat.TypeHot Water -511.478
Heat.TypeNone -32952.329
Living.Area 70.172
Bedrooms -7797.557
Bathrooms 23048.177
Rooms 3045.908
  • Forward stepwise regression models:
x
(Intercept) 7740.537
Living.Area 70.172
Land.Value 0.920
Bathrooms 23048.177
Waterfront1 120327.799
New.Construct1 -44544.524
Heat.TypeHot Air 9998.556
Heat.TypeHot Water -511.478
Heat.TypeNone -32952.329
Lot.Size 7372.038
Central.Air1 9639.198
Age -140.797
Rooms 3045.908
Bedrooms -7797.557

Our models - stable model

  • Stable: \(\widehat{Price}\) = Waterfront + Land.Value + Living.Area + Bathrooms
x
(Intercept) 3136.912
Waterfront1 123131.166
Land.Value 0.912
Living.Area 71.031
Bathrooms 27058.455

r-square and adjusted r-square

  • What percentage of the total variation of observed house prices is explained by our models
    • \(r^2\)
    • Adjusted \(r^2\)
      • Take the number of predictors in models into account

  • Compare \(r^2\) and adjusted-\(r^2\) of our models:
Models r-square Adjusted r-square
Simple 0.509 0.509
Backward 0.655 0.652
Forward 0.655 0.652
Stable 0.633 0.632

Error rate

  • How different it is between predicted house prices by our models and the actual house prices
    • MAE: mean absolute error
    • RMSE: root mean square error

k-fold cross validation

  • \(k\)-fold cross validation estimation1:

10-fold cross validation

Conclusion

  • The stable model may be the best among the four models:
    • It is the most stable (model is selected in \(78\%\) of bootstrap resamples)
    • Assumption plots look fine
    • Sill relatively high \(r^2\) (0.633) and adjusted-\(r^2\) (0.632) values
    • Does not have much higher errors than backward and forward models

Future Research

  • As shown, the stable model has relatively higher error rates and slightly lower \(r^2\) and adjusted-\(r^2\) values than the backward and forward models. It is a compromise between accuracy and stability

  • If more information about the dataset is provided, a domain knowledge expert may make better judgement of which model to choose

References